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Abstract — Combinatorial network optimization algorithms 
that compute optimal structures taking into account edge weights 
form the foundation for many network protocols. Examples in- 
clude shortest path routing, minimal spanning tree computation, 
maximum weighted matching on bipartite graphs, etc. We present 
CLRMR, the first online learning algorithm that efficiently solves 
the stochastic version of these problems where the underlying 
edge weights vary as independent Markov chains with unknown 
dynamics. 

The performance of an online learning algorithm is charac- 
terized in terms of regret, defined as the cumulative difference 
in rewards between a suitably-defined genie, and that obtained 
by the given algorithm. We prove that, compared to a genie that 
knows the Markov transition matrices and uses the single-best 
structure at all times, CLRMR yields regret that is polynomial 
in the number of edges and nearly-logarithmic in time. 

I. Introduction 

The following abstract description of combinatorial network 
optimization covers many graph theoretic algorithms that form 
the basis of network protocol design in wired and wireless 
networks. Given a graph G = (V, E), where each edge e € E 
is associated with a weight w e , find a structure consisting of 
a collection of edges satisfying some given property (e.g., 
a path, a tree, a matching, or an independent set), that 
maximizes or minimizes the sum of the weights on the selected 
edges. This kind of linear network combinatorial optimization 
covers, for instance, shortest path and minimum spanning tree 
computation used in routing protocols, and maximum-weight 
matching used for channel scheduling and switching. 

In practice, the edge weights may correspond to some 
link quality metric of interest such as packet reception ratio, 
delay, or throughput. In such a case, the edge weights are 
often stochastically varying with time. Moreover, the dynamics 
may not be known a priori. The solution approach to this 
problem that we advocate here is to combine the estimation 
and optimization phases jointly via an efficient online learning 
algorithm. 

We present in this paper an online learning algorithm 
that is designed for the setting where the edge weights are 
modeled by finite-state Markov chains, with unknown tran- 
sition matrices. We show that this problem can be modeled 
as a combinatorial multi-armed bandit problem with restless 
Markovian rewards. 

To characterize the performance of this algorithm, following 
the convention in the multi-armed bandit literature, we define 



a notion of regret, defined as the difference in reward between 
a suitably defined model-aware genie and that accumulated 
by the given algorithm over time. Specifically, in this work, 
we consider a single-action regret formulation, whereby the 
genie is assumed to know the transition matrices for all edges, 
but is constrained to stick with one action (corresponding to 
a particular network structure) at all time^] We prove that 
our algorithm, which we refer to as CLRMR (Combinatorial 
Learning with Restless Markov Rewards) achieves a regret 
that is polynomial in the number of Markov chains (i.e., 
number of edges), and logarithmic with time. This implies 
that our learning algorithm, which does not know the transition 
matrices, asymptotically achieves the maximum time averaged 
reward possible with any single-action policy, even if that 
policy is given advanced knowledge of the transition matrices. 
By contrast, the conventional approach of estimating the mean 
of each edge weight and then finding the desired network 
structure via deterministic optimization would incur greater 
overhead and provide only linearly increasing regret over time, 
which is not asympotically optimal. 

While recent work has shown how to address multi-armed 
bandits with restless Markovian rewards in the classic non- 
combinatorial setting QJ, and combinatorial multi-armed ban- 
dits in the simpler settings of i.i.d. rewards |2) or rested 
Markovian rewards [3], this paper is the first to show how to 
efficiently implement online learning for stochastic combinato- 
rial network optimization when edge weights are dynamically 
evolving as restless Markovian processes. We perform simu- 
lations to evaluate our new algorithm over two combinatorial 
network optimization problems: stochastic shortest path rout- 
ing and bipartite matching for channel allocation, and show 
that its regret performance is substantially better than that 
of the algorithm presented in Q], which can handle restless 
Markovian rewards but does not exploit the dependence be- 
tween the arms, resulting in a regret that grows exponentially 
in the number of unknown variables. 

The rest of the paper is organized as follows. We first 
provide a survey of prior work in section [TT] We then present a 
formal model of the combinatorial restless multi-armed bandit 

'Although a stronger notion of regret can be defined, allowing the genie 
to vary the action at each time, the problem of minimizing such a stronger 
regret is much harder and remains open even for simpler settings than the one 
we consider here. 
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problems in section III In section IV we present our CLRMR 
policy, and show that it requires only polynomial storage. We 
present our novel analysis of the regret of CLRMR policy 
in section [V] In section VI we discuss examples and show 
the numerical simulation results, to show that our proposed 
policy is widely useful for various interesting combinatorial 
network optimization problems. We finally conclude our paper 
in section IVlTl 



(i.e. rewards that only evolve when the arms are selected) 
Markov chains with transition probability matrices defined by 
a single parameter with identical state spaces. Also, for the 
upper bound the result is achieved only asymptotically. 
For the case of single users and independent arms, a recent 



work by Tekin and Liu |13] has extended the results in |12| 



II. Related Work 

We summarize below the related work, which has treated 
a) temporally i.i.d. rewards, b) rested Markovian rewards, and 
c) restless Markovian rewards. 

A. Temporally i.i.d. rewards 

Lai and Robbins H) wrote one of the earliest papers on 
the classic non-Bayesian infinite horizon multi-armed bandit 
problem. They assume K independent arms, each generating 
rewards that are i.i.d. over time obtained from a distribution 
that can be characterized by a single-parameter. For this 
problem, they present a policy that provides an expected 
regret that is 0(K logn), i.e. linear in the number of arms 
and asymptotically logarithmic in n. Anantharam et al. ex- 
tend this work to the case when M simultaneous plays are 
allowed (5). The work by Agrawal [6] presents easier to 
compute policies based on the sample mean that also has 
asymptotically logarithmic regret. The paper by Auer et al. (7| 
that considers arms with nonnegative rewards that are i.i.d. 
over time with an arbitrary non-parameterized distribution that 
has the only restriction that it have a finite support. Further, 
they provide a simple policy (referred to as UCB1), which 
achieves logarithmic regret uniformly over time, rather than 
only asymptotically. Our work utilizes a general Chernoff- 
Hoeffding-bound-based approach to regret analysis pioneered 
by Auer et al.. 

Some recent work has shown the design of distributed 
multiuser policies for independent arms. Motivated by the 
problem of opportunistic access in cognitive radio networks, 
Liu and Zhao [8|, Anandkumar et al. (9j, flO) , and Gai and 
Krishnamachari (TTJ, have developed policies for the problem 
of M distributed players operating N independent arms. 

Our work in this paper is closest to and builds on the recent 
work by Gai et al. which introduced combinatorial multi- 
armed bandits |2|. The formulation in [2| has the restriction 
that the reward process must be i.i.d. over time. A polynomial 
storage learning algorithm is presented in (2) that yields 
regret that is polynomial in users and resources and uniformly 
logarithmic in time for the case of i.i.d. rewards. 

B. Rested Markovian rewards 

There has been relatively less work on multi-armed bandits 
with Markovian rewards. Anantharam et al. fl2) wrote one 
of the earliest papers with such a setting. They proposed a 
policy to pick m out of the N arms each time slot and prove 
the lower bound and the upper bound on regret. However, the 
rewards in their work are assumed to be generated by rested 



relaxing the requirement of a single parameter and identical 
state spaces across arms. They propose to use UCB1 from 
|7| for the multi-armed bandit problem with rested Markovian 
rewards and prove a logarithmic upper bound on the regret 
under some conditions on the Markov chain. 

In a recent work by Gai et al. [3 1, learning policies for com- 
binatorial multi-armed bandits with rested Markovian rewards 
have been studied. Unlike (3), we adopt a model with restless 
Markovian rewards, which has much broader applications in 
many network optimization problems. 

C. Restless Markovian rewards 

Restless arm bandits are so named because the arms evolve 
at each time, changing state even when they are not selected. 
Work on restless Markovian rewards with single users and 
independent arms can be found in JTJ, |[l4[-p6). In these 
papers there is no consideration of possible dependencies 
among arms, as in our work here. 

Tekin and Liu JTJ have proposed a RCA policy that achieves 
logarithmic single-action regret when certain knowledge about 
the system is known. We use elements of the policy and 
proof from jT| in this work, which is however quite different 
in its combinatorial matching formulation (which allows for 
dependent arms). Liu et al. fl4) , fl5) adopted the same 
problem formulation as in IT], and proposed a policy named 
RUCB, achieving a logarithmic single-action regret over time 
when certain system knowledge is known. They also extend 
the RUCB policy to achieve a near-logarithmic regret asymp- 
totically when no knowledge about the system is available. 

Dai et al. in [16] adopt a stronger definition of regret: 
the difference in expected reward compared to a model- 
aware genie. They develop a policy that yields regret of order 
arbitrarily close to logarithmic for certain classes of restless 
bandits with a finite-option structure, such as restless MAB 
with two states and identical probability transition matrices. 

III. Problem Formulation 

We consider a system with TV edges predefined by some 
application, where time is slotted and indexed by n. For each 
edge i (1 < i < N), there is an associated state that evolves 
as a discrete-time, finite-state, aperiodic, irreducible Markov 
chairj^jX^n), n > 0} with unknown parameter^ We denote 
the state space for the i-th Markov chain by S l . We assume 
these N Markov chains are mutually independent. The reward 
obtained from state x (x € S z ) of Markov chain i is denoted as 

2 We also refer Markov chain {X I (n),n > 0} and Markov chain i 
interchangeably. 

Alternatively, for Markov chain {X l (n),n > 0}, it suffices to assume 
that the multiplicative symmetrization of the transition probability matrix is 
irreducible. 



3 



r l x . Denote by ir x the steady state distribution for state x. The 
mean reward obtained on Markov chain i is denoted by /i\ 
Then we have // = r x w x . The set of all mean rewards 

is denoted by fi = {//}. 

At each decision period n (also referred to interchange- 
ably as time slot), an iV-dimensional action vector a(n), 
representing an arm, is selected under a policy 4>(n) from 
a finite set T . We assume ai(n) > for all 1 < i < N. 
When a particular a(n) is selected, the value of r*./ n \ is 
observed, only for those i with ai(n) ^ 0. We denote by 
•A-edn) — {i ■ cii(n) 7^ 0, 1 < i < N} the index set of all 
ai(n) 7^ for an arm a. We treat each a(n) € T as an arm. 
The reward is defined as: 



compute index compute index 
^4 nlav armaiu-ll A 



Xi (n) 



(i) 



where Xi (n) denotes the state of a Markov chain i at time n. 

When a particular arm a(n) is selected, the rewards corre- 
sponding to non-zero components of a(n) are revealed, i.e., 
the value of r^-On) ' s °t )serve d f° r all i such that etj(n) ^ 0. 

The state of the Markov chain evolves restlessly, i.e., the 
state will continue to evolve independently of the actions. We 
denote by P l = (p x v ) x ,yeS i me transition probability matrix 
for the Markov chain i. We denote by (P r )' = {{p l )' x y } x .y£S i 
the adjoint of P l on ^(tt), so {p l )' XtV = Py^yl^x- Denote 
pi _ (piyp as m e multiplicative symmetrization of P i . For 
aperiodic irreducible Markov chains, P's are irreducible fl7) . 

A key metric of interest in evaluating a given policy cj> 
for this problem is regret, which is defined as the difference 
between the expected reward that could be obtained by the 
best-possible static action, and that obtained by the given 
policy. It can be expressed as: 



«R*(n) = 



717 



t=i 



(2) 



t=l ieA*(t) 



where 7* = max a iA^ is me expected reward of the 

optimal arm. For the rest of the paper, we use * as the index 
indicating that a parameter is for an optimal arm. If there 
is more than one optimal arm, * refers to any one of them. 
We denote by j a the expected reward of arm a, so 7 a = 

1-4,1 

For this combinatorial multi-armed bandit problem with 
restless Markovian rewards, our goal is to design policies that 
perform well with respect to regret. Intuitively, we would like 
the regret 5H^(n) to be as small as possible. If it is sublinear 
with respect to time n, the time-averaged regret will tend to 
zero. 



playarma(n-l) ■ 
1 SB2 SB3 SBI 



compute inde 



t — r 



play arm a(n+l) K 

SB3 SBI SB2 SB3 



T 



. . . ^n-i) . . . . . . £a(n) . . . £a(w) . . ,£a(n+l) £a(n+l) 

Fig. 1. An illustration of CLRMR 

IV. Policy Design 

For the above combinatorial MAB problem with restless 
rewards, we have two challenges here for the policy design: 

(1) A straightforward idea is to apply RCA in fl], or RUCB 
in p4[ directly and naively, and ignore the dependencies 
across the different arms. However, we note that RCA and 
RUCB both require the storage and computation time that 
are linear in the number of arms. Since there could be 
exponentially many arms in this formulation, it is highly 
unsatisfactory. 

(2) Unlike our prior work on combinatorial MAB with 
rested rewards, for which the transitions only occur each time 
the Markov chains are observed, the policy design for the 
restless case is much more difficult, since the current state 
while starting to play a Markov chain depends not only on 
the transition probabilities, but also on the policy. 

To deal with the first challenge, we want to design a policy 
which more efficiently stores observations from the correlated 
arms, and exploits the correlations to make better decisions. 
Instead of storing the information for each arm, our idea is 
to use two 1 by N vectors to store the information for each 
Markov chain. Then an index for each each arm is calculated, 
based on the information stored for underlying components. 
This index is used for choosing the arm to be played each time 
when a decion needs to be made. 

To deal with the second challenge, for each arm a we 
note that the multidimensional Markov chain {X a (n),n > 0} 
defined by underlying components as X a (n) = (X z (n)) ie ^ a 
is aperiodic and irreducible. Instead of utilizing the actual 
sample path of all observations, we only take the observations 
from a regenerative cycle for Markov chains and discard the 
rest in its estimation of the index. 

Our proposed policy, which we refer to as Combinatorial 
Learning with Restless Markov Reward (CLRMR), is shown 
in Algorithm [T] Table [I] summerizes the notation we use for 
CLRMR algorithm. For Algorithm [l] (x 4 ) 4e ^ a = (C);^ 
means x.- L — £\Vi. 

CLRMR operates in blocks. Figure [T]illustrates one possible 
realization of this Algorithm[T] At the beginning of each block, 
an arm a is picked and within one block, this algorithm always 
play the same arm. For each Markov chain {X % (n)}, we 
specifiy a state Q at the beginning of the algorithm as a state 
to mark the regenerative cycle. Then, for the multidimentional 
Markov chain {X a (n)} associated with this arm, the state 
(C)i€A B . is use d t° define a regenerative cycle for {X a (n)}. 

Each block is broken into three sub-blocks denoted by SBI, 
SB2 and SB3. In SBI, the selected arm a is played until the 
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Algorithm 1 Combinatorial Learning with Restless Markov 
Reward (CLRMR) 

// Initialization 

t = 1, t 3 = l; 

Vi = 1,-- • ,iV, m| = 0, z| = 0; 
for b = 1 to TV do 

t:=t + l,t 2 :=t 2 + l; 

Play any arm a such that 6 £ „4 a ; denote (a^)^^ as 
the observed state vector for arm a; 

Vi G -A a (n)> l et C be the first state observed for 

Markov chain i if C has never been set; z\ := — — ttt^N 
m 2 := m 2 + 1; 

while (x 4 ) 4e ^ a ^ (C)ie^. do 
t:=t + l,t 2 :=t 2 + l; 

Play arm a; denote (a^)^^ as the observed state 

vector; 



Vi G .4 a (n), ^ := m i +1 * , 
end while 
end for 
// Main loop 
while 1 do 

// SB1 STARTS 
t:=t + l; 

Play an arm a which maximizes 



max > a,- 



1; 




(3) 



Denote (x^nz^ as the observed state vector; 
while (x,i) teAa ^ (C)ieA a do 
t:=t + l; 

Play an arm a and denote (£j)je.Aa as the observed 
state vector; 
end while 

// SB2 STARTS 
t 2 := *2 + 1; 

Vi G ^4 a („), ^2 : = ^H-l*' ' TO 2 : = m 2 + I! 

while {xi) teAa ^ (C)ieA a do 
t:=t + l,t 2 := t 2 + 1; 

Play an arm a and denote (xi)i eAa as the observed 
state vector; 



Vi e A, 



1; 



• ] i i , //to . //fco 

* m^ + l * 1 

end while 

// sb3 is the last play in the while loop. 
Then a block completes. 
33: b := 6+1, t :=t+ 1; 
34: end while 



state (C l )ie^ a is observed. Upon this observation we enter a 
regenerative cycle, and continue playing the same arm untill 
(CXe-Aa is observed again. SB2 includes all time slots from 
the first visit of (C)ieA a U P t° but excluding the second 
visit to {C)ieA a - SB3 consists a single time slot with the 



N 
a: 

t: 
t%: 

b: 

ml 
4: 



X ; . 



number of resources 

vectors of coefficients, defined on set J 7 ; 
we map each a as an arm 
{i : a t ^ 0, 1 < i < N} 
current time slot 

number of time slots in SB2 up to the current 
time slot 

number of blocks up to the current time slot 
number of times that Markov chain i has been 
observed during SB2 up to the current time slot 
average (sample mean) of all the observed 
values of Markov chain i during SB2 up to 
the current time slot 

state that determine the regenerative cycles for 
Markov chain i 

the observed state when Markov Chain i is 
played; (xj)jg^ a is the observed state vector 
if arm a is played 



TABLE I 
Notation for AlgorithmQ] 



second visit to (C)ieA a - SB1 is empty if the first observed 
state is (C)ieA a - So SB2 includes the observed rewards for 
a regenerative cycle of the multidimentional Markov chain 
{X a (n)} associated with arm a, which implies that SB2 also 
includes the observed rewards for one or more regenerative 
cycles for each underlying Markov chain {X l (n)}, i 6 A a . 

The key to the algorithm [T] is to store the observations for 
each Markov chain instead of the whole arm, and utilize the 
observations only in SB2 for them, and virtually assemble 
them (highlighted with thick lines in Figure [T}. Due to 
the regenerative nature of the Markov chain, by putting the 
observations in SB2, the sample path has exactly the same 
statics as given by the transition probability matrix. So the 
problem is tractable. 

LLR policy requires storage linear in N. We use two 1 
by N vectors to store the information for each Markov chain 
after we play the selected arm at each time slot in SB2. One is 
(^)ixJV m which z\ is the average (sample mean) of observed 
values in SB2 up to the current time slot (obtained through 
potentially different sets of arms over time). The other one is 
(to 2 )i x jv in which m\ is the number of times that {X l (n)} 
has been observed in SB2 up to the current time slot. 

Line [T] to line [T3] are the initialization, for which each 
Markov chain is observed at least once, and C, 1 is specified 
as the first state observed for {X l (n)}. 

After the initialization, at the beginning of each block, 
CLRMR selects the arm which solves the maximization prob- 
lem as in ([3J. It is a deterministic linear optimal problem with 
a feasible set T and the computation time for an arbitrary T 
may not be polynomial in N. But, as we show in Section VI 
there exist many practically useful examples with polynomial 
computation time. 
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V. Analysis of Regret 

We summarize some notation we use in the description and 
analysis of our CLRMR policy in Table [IT] 

We first show in Theorem [T] an upper bound on the total 
expected number of plays of suboptimal arms. 

Theorem 1: When using any constant L > 56(H + 

l)^nmx r max^max/ e min, We have 



E 

a:7 a <7* 



(7* - 7 a )E[T a (n)] < Z 1 \nn + Z 2 



where 



,. _ . ,1 A 4NLH 2 a 2 max 

Z\ — A max I — V M max + 1 



n 



mm 

1 



n, 



+ A^max + 1 IN + 



A 2 . 

mm 

nNHS n 



3tt 



nun 



To proof Theorem [T] we use the inequalities as stated in 
Theorem 3.3 from |18| and a theorem from [19]. 

Lemma 1 (Theorem 3.3 from [18]): Consider a finite- 
state, irreducible Markov chain {X t } t>1 with state space S, 
matrix of transition probabilities P, an initial distribution q 



: .xe S) 



Let 



and stationary distribution tt. Let iV q = 

P = P'P be the multiplicative symmetrization of P where 
P' is the adjoint of P on l 2 {^)- Let e = 1 — A 2 , where A2 
is the second largest eigenvalue of the matrix P. e will be 
referred to as the eigenvalue gap of P. Let / : S TZ be 

such that E v es*yf(y) = ^ ll/IL < 1 and < II/II2 < 
If P is irreducible, then for any positive integer n and all 

< 5 < 1 



28 



Lemma 2: If {X n } n>0 is a positive recurrent homogeneous 
Markov chain with state space S, stationary distribution n and 
r is a stopping time that is finite almost surely for which 
X T = x then for all y s S 



E 



^I(X t = y)\X D = 



t=o 



= E[t\X q = x]ir y 



Proof of Theorem |7J 
We introduce B % (b) as a counter for the regret analysis 
to deal with the combinatorial arms. After the initialization 
period, B l (b) is updated in the following way: at the beginning 
of any block when a non-optimal arm is chosen to be played, 
find i such that i = are min ml (i the index of the elements 

which are among the ones that have been observed least in SB2 
in the non-optimal arm). If there is only one such arm, B l (b) is 
increased by 1. If there are multiple such arms, we arbitrarily 
pick one, say i', and increment B % by 1. Based on the above 
definition of B l (b), each time a non-optimal arm is chosen to 
be played at the beginning of a block, exactly one element 
in (B l (b))ixN is incremented by 1. So the summation of all 



H : max \A a \. Note that H < N 

a 

a(r) : the arm played in time r 

b(n): number of completed blocks up to time n 

t(b): time at the end of block b 

t2(b): total number of time slots spent in SB2 

up to block b 
B a (6): total number of blocks within the first b 

blocks in which arm a is played 
m\(t2(b)): total number of time slots Markov chain i 

is observed during SB2 up to block b 
z 2 ( s ) : me mean reward from Markov chain i 
when it is observed for the s-th time of 
only those times played during SB2 
T(n): time at the end of the last completed block 
T a (n): total number of time slots arm a is palyed 

up to time T(n) 
ml.(s): number of times that state x occured when 
Markov chain i has been observed s times 
Yl(j): vector of observed states from SB1 of the 

j-th block for playing Markov chain i 
Y^ij): vector of observed states from SB2 of the 

j-th block for playing Markov chain i 
Y l (j): vector of observed states from the j-th 
block for playing Markov chain i 
max{7r* , 1 - 7r* } 

max 7r* 

■i.xes* 

min 7r* 

i,xeS i 

max 7r* 

eigenvalue gap, defined as 1 — A2, where 
A2 is the second largest eigenvalue of the 
multiplicative symmetrization of P l 

min e* 

i 

max I S z I 

i 

max r* 
max di 

7* -7 a 
min A a 

A max : max A a 

7"<7* 

{X & (n)}: multidimentional Markov chain defined 

by X a (n) = {X\n)) i&Aa 
C a : {C)ieA a i state vector that determines 

the regenerative cycles for {X a (n)} 
LI a : steady state distribution for state z of {X A (n)} 

n mi „: mip. 11 ? 
LI min : min IT a 

a,zGS a 

M a Z2 : mean hitting time of state z 2 starting 
from an initial state z\ for {X a (n)} 
M a ax : max M% 

7max : max 7 a 

7 S7 



^"min • 

^min - 
'-'max • 
^max- 

A a : 

A m i n . 



TABLE II 
Notation for Regret Analysis 
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counters in (B l (b)) 1X N equals the total number of blocks in 
which we have played non-optimal arms, 



JY 



J2 E[B a (&)]=£E[zr(h)]. (4) 



a:7 a <7* 



We also have the following inequality for B l (b): 
B\b) < m\{t{b- 1)),V1 < i < N,Vb. 



(5) 



Denote by c t , s J^f^- Denote by P(b) the indicator 

function which is equal to 1 if B l (b) is added by one at block 
b. Let I be an arbitrary positive integer. Then we can get the 
upper bound of (6)] shown in (roll, 



| A,. | l^a(T)l 

>'{E ^ E a r>Ma% Vj ] (16) 

3=1 3=1 

\A„\ 

=nJ2 <(^K)+^s-) (17) 

3=1 

l^a(r)l 

< ]T a pA T M J (%) + c r , Spj )} (18) 

3=1 

= P{At least one of the following must hold: 

I.A..I |^a.| 

E a hjh i ( S h j ) < 7* - E a h 3 C r,s h] , (19) 
3=1 3=1 

IAi(t)I l-4a(x)l 

a Pj (r)z p 2 '(s Pj )>^+ J2 M r K^ (20) 

3=1 3=1 

l-4a(-r)l 

7* < 7 a(T) +2 J2 a P i {r)c Tl , Pi ] (21) 

3=1 



E[B\b)}= P TO) = 1} 

b 

<l+ V{I i (/3) = l,B i (l3-l)>l} 

P=N+1 (6) 
b 

<l+ Y P { E a k9t 2 (f3-l), m *(t(f3-l)) 
/3 = N+1 fee^ a * 

36-^a(h) 



where gj s = Z2( s ) + C t.s an d a (/^0 is defined as a non-optimal 
arm picked at block (5 when P(f3) = 1. Note that to?, — 
min{TO,2 : Vj G -4- a (/3)}- We denote this arm by a(/3) since at 

each block that P(f3) = 1, we could get different arms. 

Note that I implies, 

I < B\p - 1) < TO 2 (t(/3 - l)),Vj € A^). (7) 



So we can further derive the upper bound of ~Wi\B l (b)] shown 



in 



15 i, where hj (1 < j < \A a * |) represents the j-th element 
in A a *\ Pj (1 < j < |./4. a (m|) represents the j-th element in 
or A a [t)- A & ( T ) represents the arm played in the r-th 
time slots counting only in SB2. Note that 



Now we show the upper bound on the probabilities of inequali- 
ties ( p"9| ), p0| ) and pi) separately. We first find an upper bound 
on the probability of (fT9]>: 









P{E <#'K)<7* 

3=1 


- E °^ Ct 

3=1 


Shj } 






|-4a.| 


= p{E <i^"K0< 




-E 


3=1 


3=1 


3=1 


|-4a.| 






< J2 P K4 3 'K)< 




*„,)} 



3 = 1 
Aa.l 

= E P ^2 3 K) -Cr, Sh .}. 
3=1 

VI < j < |A.*|, 

p{4 J K)<^-c r ^.} 

x-^ rx 3 rrix 3 (st, ■ ) u u \ c r.s h . 



P{- — - r£'7# < -tt^tI 
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b \A*«\ \A*V>)\ 

W 1 \B i (b)}<l + V P{ min V at. n < max V a v . } 

6 t 2 (/3-l) t 2 (0-l) t 2 (0-l) t 2 (P-l) \A*.\ l^a(«l 

<*+ E E • • E E ••• E fiE^W),.^ E Mfl^-W 

|8=JV+1 s hl =l *h M ..|=ls J > 1 =ta(0 ^ ,=ta(0 3=1 3=1 
*2(b) r-1 r-1 r-1 r-1 |Aw| l-4a( T )l 

• ' • E E ••• E E - E *{£X#.*,< E «wW#^} < 15 > 



\S h i 



} 



E p{E^K)-E<^ 



ft . ^ S hj Cr,a hj 
*'\S h i\ 



E p {— 



71V Sfc . 



< J L T 



ShjCT,s h:j 

- r^\S h i\ 



(22) 
(23) 



where ( f2"2"l > follows from Lemma [T] by letting 



Sh^Cr 

5 = ^—^, m) 



\{Y{ ?x)-{l- 4) 



1(a) is the indicator function defined to be 1 when the 
predicate a is true, and when it is false. 7T* is defined as 
7T* = max{7r* , 1 — 7r^.} to guarantee WfW^ < 1. We note that 
when 5 > 1 the deviation probability is zero, so the bound 
still holds. 



(23 1 follows from the fact that for any q\ 



N q ? = 



^E 

x=l 



< 



\S'\ 



*x 112 



Note that all the quantities in computing the indices and the 
probabilities above come from SB2. Got for every SB2 in a 
block, the quantities begin with state £ a and end with a return 
to £ a . So for each underlying Markov chain {X l (n)}, i € A a , 
the quantities got begin with state Q l and end with a return 
to £\ Note that for all i, Markov chain {X l (n)} could be 
played in different arms, but the quantities got always begin 
with state and end with a return to £\ Then by the strong 
Markov property, the process at these stopping times has the 
same distribution as the original process. Connecting these 
intervals together we form a continuous sample path which 



can be viewed as a sample path generated by a Markov chain 
with transition matrix identical to the original arm. This is the 
reason why we can apply Lemma [T] to this Markov chain. 
Therefore, 

n E a h 3 4 3 K) < 7* - E < J 

3=1 3=1 



HS n 
< r 



With a similar derivation, we have 



(24) 



l-4a(r)l IA.(t)I 

P{ J2 W (r)4'(*Pi)>7 a(r) + E a pA^r,s Pj } 
3=1 3=1 

< E P {°« ( T )^ J ) > a P 3 (t)A* W + a P] ( T ) C r,s Pj } 
3=1 

IA, M I g c 

< E E ^ww-^w ^ ^h^) 



3=1 ^es p j 



15^ 



-Pi 

7Tl S 



Pi 



= E E p {- 

3=1 ieS"i 

IA,( T )I LCPJ 

3=1 ies p j 

7T C L£ min 

< Jmai t ~ 2asg, a ,.rg la ..ir2 Qa .. 



ft ■■-Pj , 



(25) 
(26) 



where ( 25 1 follows from Lemma [T] by letting 



Note that when Z > 



4L In t 2 (b) 



gives, 



(21 



is false for t, which 
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7 *-7 a(T) -2 J2 a v^y 

3 = 1 

l-4a(T)l 

2 ^ a p .- 



7 ~7' 



a(r) 



'£lni a (6) 



'ft 



AL \nt 2 {b) 



I 



> V - 



7 -7 



a(T) 4Llnt 2 (6) / A a(t) 



> 



7 -7 a(T) -A aM =0 



Hence, when we let I > 



4Lff 2 af nax lnt 2 (b) 



for all a(r). Therefore, we have (29 1 



Following (29 1, 



(21 



(27) 

(28) 
is false 



^r^w.M ALH 2 aL av hin 
E[B\b)] < + 1 



A 

+ ^ 5max V 2r~ 

^"min 



2 

min 



Lin -56HS. 
28S2 



2 „2 -2 



A 2 ^ + ^ 

"min " mm r _ 

4Li/ 2 a max lnn ^ i irHS max 



(32) 
(33) 



A 2 . 

mm 



37T m i 



mm 

d33j follows since L > 56(H + ^S^r^fc^Je 



According to Q, 



N 



E[B a (6)]=^E[F(6)] 



a:7 a <7* 



< 



4NLH 2 al _lru 



•iV- 



nNHS n 



A min 37T mln 



Note that the total number of plays of arm a at the end 
of block b(n) is equal to the total number of plays of arm 
a during SB2s (the regenerative cycles of visiting state £ a ) 
plus the total number of plays before entering the regenerative 
cycles plus one more play resulting from the last play of the 
block which is state £ a . So we have 

E[T a (n)] < (=±- + M max + E[B»{b{n))]. 



Therefore, 



E 

a:7 a <7* 



( 7 * - 7 a )E[T>)] 

<A max Yl (na-+ M max + l)^ a (6(n))] (35) 

a .-ya<,y» \ min / 

< A max (— *— + M max + l) J2 E l B *( b ( n ))} 06) 



a:7 a <7* 



< Z\ In n + Z 2 



where 



Z\ — A max 
Z% = A max 



1 



n m in 

1 



n„ 



+ M max + 1 IN + 



4iV£i? 2 o 2 nax 
A 2 • 

mm 

irNHS„ 



37T n 



Now we show our main results on the regret of CLRMR 
policy as in Theorem [2] 

Theorem 2: When using any constant L > 56(H + 
l^max^ax^maxAmin, the regret of CLRMR can be upper 
bounded uniformly over time by the following, 
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CLRMR 



(n) < Z$ In n + Z4 



(37) 



where 



Z3 —Z t + Z 5 

min 

Z, = Z 2 + 7 *(— + Mmax + 1) + Z 5 (N + ?ri ! ggmaX ) 



and 



Z 5 = 7max( 1 = r 1 — + Mnax + 1 - — — ) + 7* M n 



Proof: Denote the expectations with respect to policy 
CLRMR given £ by E^. Then the regret is bounded as, 

T(n) 

(34) ^ LRMR (n)= 1 *E c [T(n)}-E c [J2 E ^Xd)] 



t—1 i£A a ( t ) 

n 

+ 7 *E c [n-T(n)]-E c [ £ E o*(*)^(t)] 

t=T(n) + l »e-4 a( t) 

< ^7*E c [T(n)]-^ 7 a E c [T a (n)]^ + 7 *E c [n - T(n)] 

— , T( ™ } — , 

+ ^ 7 a Ec[T a (n)]-E c [^ ]T «*(*)<(*)] 

a t=l ie^«(t) 

< Z x In n + Z 2 + 7* (— *— + M max + 1) 



(38) 



^T 7 a E c [T»]-E c [£ £ a <('X(t) 

l t = l iS^la(t) 



(39) 



where (38 1 follows from Theorem [T] and E^[n — T(n)] 



< 



II in ^ Anax ~t~ 1 



9 



E[B*(6)] < 



4Li? 2 a max 



lnt 2 (6) 



A 2 . 

mm 



t 2 (b) r -l 

E E 



E E- E 



2HS n 



(29) 



p i.a. 



Note that where 

T{n) 

5> a E<[T»]-E c []T ^Kd)} 
< 7 *E f [T»]+ 7 a E c [T»] 

a:7 a <7* 

B*(6(n)) 

- E E a W[ E E W = y)] 

ieA** yes* j i r t i ey < 0') 

B a (b(«)) 

- E E E«*i E E ictf = »)] 

(40) 

where the inequality above comes from counting only in Y% 0) 
instead of in (40 1. Then applying Lemma|2]to (40 1, we 

have 

E C [ ]T J] l(^ = y)] = ^fEc[ J B a (6(n))]. 

So 

B a (6(n)) 

E E E«*i E E w=y)] 

<" E -^E c [i? a (6(n))]. (41) 

a ^ * "max 

Also note that 



£ 7 a E<[T»] 

a:7 a <7* 

< E 7 a (nj-+^nax + l)Ec[S a (&H)] (42) 



7T ■ 

a:7 a <7* mm 



Inserting ((4T]i and (|42]» into ( [40) , we get 

T(n) 

^ 7 a E c [T a (n)]-E c [^ J] 

a t=l i£^ a(t) 

< 7 *E C [T»] 
+ E 7 a ( TT ^+M a ax + l-— )Ec[^ a Wn))] 

a:7 a <7 , li min "max 
B'(b(n)) 

- e e^m e e 

ie.A a « yes* i r/er 4 (i) 

= Q*(n) 

1 



E 7 a (^ 



A/ a ax + l-— )E c [S a (6(n))], 



Q*(n)= 7 *E c [T*(n)] 

- E E «^ E ci E ^ 

ieA** ySS ; 



B*(b{n)) _^ 

E W = y)] 



We now consider the upper bound for Q*(n). We note that 
the total number of time slots for playing all suboptimal arms 
is at most logarithmic, so the number of time slots in which 
the optimal arm is not played is at most logarithmic. We could 
then combine the successive blocks in which the best arm is 
played, and denote by Y*(j) the j-th combined block. Denote 
b* as the total number of combined blocks up to block b. Each 
combined block Y* starts after dis-continuity in playing the 
optimal arm, so b*(n) is less than or equal to total number of 
completed blocks in which the best arm is not played up to 
time n. Thus, 



E c [£*(n)] < 



a:7 a <7* 



E c {B»(b(n))}. 



(43) 



Each combined block Y* consists of two sub-blocks: Yj* 
which contains the state vectors for the optimal arm visited 
from beginning of Y* (empty if the first state is £*) to the state 
right before hitting £* and sub-block F 2 * which contains the 
rest of Y* (a random number of regenerative cycles). Denote 
the length of Y* by |Yi*| and the length of Y* by \Y*\. We 
denote Y^ij) by the states for Markov chain i for all i £ A & * 
in Y 2 *. 

Therefore we get the upper bound for Q*(n) as 
Q» = 7 *E C [T»] 

_^ B'{b(n)) _^ 

- J2 E«^ E c[ E E = (44) 



i£A a * yGS ; 



3 YfEYi(j) 
b'(n) 



-MttIEJV \Y* 



^ E E ^v^ciz^ i j 2 

ieA** yes* i=i 

■ e e «? We e 



(45) 



i£A a * yeS* 



i =1 Yi^YiH) 



b*{n) 



j£-4 a * yeS' 



1(17=2/)] (46) 

(47) 
(48) 



a:7 a <7* 



a:7 a <7* 



where the inequality in ( |45[ ) comes from counting only the 
rewards obtained in sub-block Y] in (44 1. Also, note that based 
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on Lemma [2] ( 45 i equals ( 46 1, and therefore we have the 
inequality ( |48| l. 
Hence, V£, 

1 



CLR w i; 



(n) < Zi Inn + Z 2 + 7 *( + M max + 1) 



(49) 



Replacing ct, s with ^/ L ("(*)) lnt ; and replacing L with 
L(n(i 2 (6))) or L(n(r)) accordingly in the proof of Theorem 
[1] @ to ([fejl still stand. 

L(n(r)) is a diverging non-decreasing sequence, so 
there exists a constant r' such that for all r > t', 
L(n(r)) > 56(H + 1)& r^ ax 7ri ax /e min , which implies 



E 

a:7 a <7* 

max / j 
a:7 a <7* 



+ 7*M* 



7 a (M* ax + l)E f [B a (&(n))] 
E c [i? a (6(n))] 



< Zi In n + Z 2 + 7* ( - 



1 



1 



M max + 1) 
1 



-56ffS^. 
2SS max'max f ' 

Thus, we have 

E[£T(&)] < 
HS m , 



4L(n(i 2 (&)))iJ 2 a 2 nax lnn 



A 2 . 

mm 



(53) 



- 7 *M max )E c [B a (&(n))] 
J3 m n -f Z/4, (50) 
where ( |50"l l follows from Theorem [T] and (|34j, and 

4jVLff 2 a max 
Z 3 - Zi + z 5 



Z 4 = Z 2 + 7* ( 



1 



n„ 



Zs is defined as 



Z b = 7max( 



n„ 



+ l) + Z 5 (iV + 



TrNHS n 



m 

Theorem [2] shows when we use a constant L > 56 (if + 
i^max^max^Lx/emin, the regret of Algorithm [I] is upper- 
bounded uniformly over time n by a function that grows as 
0(iV 3 lnn). However, when S, 



max) ' max 



, TTmax or e min (or the 



bound of them) are unknown, the upper bound of regret can 
not be guaranteed to grow logarithmically in n. 

When no knowledge about the system is available, we ex- 
tend the CLRMR policy to achieve a regret bounded uniformly 
over time n by a function that grows as 0(N 3 L(n) Inn), 
using any arbitrarily slowly diverging non-decreasing sequence 
L{n) in Algorithm[T] Since L(n) could grow arbitrarily slowly, 
this modified version of CLRMR, named CLRMR-LN, could 
achieve a regret arbitrarily close to the logarithmic order. We 
present our analysis in Theorem [3] 

Theorem 3: When using any arbitrarily slowly diverging 
non-decreasing sequence L(n) (i.e., L(n) —> oo as n — > oo), 
and replacing ([3]) in Algorithm [T] accordingly with 



(51) 



where n(t 2 ) is the time when total number of time slots spent 
in SB2 is t 2 , the expected regret under this modified version 
of CLRMR, named CLRMR-LN policy, is at most 

m. CLRMR - LN (n) < Z 6 L(n) Inn + Z 7 (52) 




where Zq and Z7 are constants. 
Proof: 



< 



4£(n)ff 2 a max hn 



nHSn 

37T mi 



Z 8 



where 



pro ' / Lt n 

1 1 O max x > n 

z 8 = 2^ lT 

T^miii . 

r— 1 

Then we can according have 

£ (7*- 7 a )E[T»] 

a:7 a <7* 

< ZgL(n) In n + Z 2 + A ma 
where 



(54) 



(55) 



n„ 



1 



So 



where 



n„ 



CLRMR-LN 



M„ 



M„ 



ANH 2 a 



1 JVZ„. 



2„2 

max 



A 2 . 

mm 



9t' 



Zk — Zq 



(n) < ZeL(n) Inn 



(56) 



(57) 



Z: 



4iVi? 2 a 2 nax 



A 2 



Z 7 = Z 2 + 7* ( 



n„ 



i) 



Z 5 {N- 



n m i n 

irNHS n 



NZ 7 



(58) 



NZ 7 ). 



VI. Applications and Simulation Results 

We now present an evaluation of our policy over stochastic 
versions of two combinatorial network optimization problems 
of practical interest: stochastic shortest path (for routing), and 
stochastic bipartite matching (for channel allocation). 
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A. Stochastic Shortest Path 

In the stochastic shortest path problem, given a graph G = 
(V,E), with edge weights (-Dy) stochastically varying with 
time as restless Markov chains with unknown dynamics, we 
seek to find a path between a given source s and destination 
t with minimum expected delay. We can apply the CLRMR 
policy to this problem, with some very minor modifications 
to the policy and the corresponding regret definition to be 
applicable to a minimization problem instead of maximization. 

For the stochastic shortest path problems, each path between 
s and t is mapped to an arm. Although the number of paths 
could grow exponentially with the number of Markov chains, 
\E\. CLRMR efficiently solves this problem with polynomial 
storage \E\ and regret scaling as 0(\E\ 3 log n). 




Fig. 2. A graph with 19 links and 260 acyclic paths between s and t for 
stochastic shortest path routing. 

We show the numerical simulation results for the graph in 
Figure [2] We assume each link has two states with the delay 
0.1 on good links, and 1 on bad links. Table III summarizes 
the transition probabilities on each link. 



Link 


Poi. PlO 


Link 


POli PlO 


Link 


Poi. PlO 


e.l 


0.2, 0.8 


e.8 


0.3, 0.8 


e.15 


0.1, 0.8 


e.2 


0.3, 0.9 


e.9 


0.1, 0.9 


e.16 


0.8, 0.1 


e.3 


0.2, 0.7 


e.10 


0.9, 0.1 


e.17 


0.2, 0.7 


e.4 


0.7, 0.1 


e.ll 


0.3, 0.8 


e.18 


0.9, 0.1 


e.5 


0.3, 0.9 


e.12 


0.2, 0.7 


e.19 


0.3, 0.8 


e.6 


0.2, 0.7 


e.13 


0.8, 0.1 






e.7 


0.2, 0.8 


e.14 


0.4, 0.8 







TABLE III 
Transition probabilities 

Figure [3] shows the simulation results. We see that our 
proposed CLRMR performs better than RCA, the algorithm 
presented in 1 1 1 for all L values considered. If we let 
L = 1512 in this problem, we have that L > 56(H + 
l)^max r max^max/emin- For lower values of L it is not 
guaranteed by the analysis that the algorithms should yield 
logarithmic regret. However, numerically, we find that both 
policies seem to achieve logarithmic regret, and yield much 
better regret performance, even for much smaller L values. It 
is unclear whether this can be proved rigorously or whether it 
is due low probability events not captured in the simulations. 

B. Stochastic Bipartite Matching for Channel Allocation 

As a second application, we consider an application in a 
cognitive radio networks where M secondary users interfering 
with each other need to be allocated to Q non-conflicting 
orthogonal channels. We assume that, due to geographic 



-RCA Policy, L - 1512 
-CLRMR Policy, 

- RCA Policy, L = 50 

- CLRMR Policy, L = 50 

- RCA Policy, L = 2 
-CLRMR Policy, L = 2 




Fig. 3. Comparison of normalized regret 
stochastic shortest path problem. 



In n 



n time slots for the 




■ RCA Policy, L = 

■ CLRMR Policy I 

- RCA Policy, L = 50 

- CLRMR Policy, L = 50 

— RCA Policy, L = 2 

— CLRMR Policy, L ^ 2 



35 



Fig. 4. Comparison of normalized regret -p* n ' vs. n time slots for Stochastic 
Bipartite Matching / Channel Allocation Problem. 



dispersion, each user may see different primary user occu- 
pancy behavior in each channel. The availability of spectrum 
opportunities on each user-channel combination (i,j) over a 
decision period is modeled as a restless two-state Markov 
chain. It is easy to show that applying CLRMR to this problem 
yields storage linear in MQ, and a regret bound that scales as 
0(min{M,Q} 2 A/Qlogn), following Theorem [2] 

We show simulation results comparing CLRMR again with 
RCA for a system consisting of 9 orthogonal channels, and 
5 secondary users. The transition probability matrix used for 
these scenarios is presented in table |IV] 





ch.l 


ch.2 


ch.3 


ch.4 


ch.5 


ch.6 


ch.7 


ch.8 


ch.9 


U.1 


0.5,0.6 


0.2,0.7 


0.2,0.9 


0.8,0.1 


0.2,0.7 


0.3,0.7 


0.2,0.9 


0.2,0.7 


0.1,0.9 


u.2 


0.3,0.8 


0.1,0.9 


0.2,0.8 


0.3,0.7 


0.3,0.6 


0.2,0.8 


0.4,0.7 


0.2,0.8 


0.9,0.2 


u.3 


0.8,0.1 


0.2,0.7 


0.3,0.7 


0.2,0.8 


0.5,0.6 


0.2,0.7 


0.2,0.7 


0.2,0.8 


0.1,0.9 


u.4 


0.3,0.9 


0.2,0.8 


0.2,0.9 


0.4,0.6 


0.9,0.2 


0.2,0.9 


0.2,0.9 


0.2,0.9 


0.2,0.9 


u.5 


0.5,0.6 


0.2,0.7 


0.3,0.9 


0.2,0.7 


0.5,0.5 


0.2,0.7 


0.8,0.1 


0.3,0.9 


0.3,0.9 



TABLE IV 

Transition probabilities poi. P10 for each user-channel pair 

The simulation results are shown in Figure [4] As in 
the stochastic shortest path problem, we find that CLRMR 
consistently outperforms RCA, for all values of L. Here 
L = 1135 corresponds to ensuring that L > 56(H + 



1)5, 



max max max 



/emin, which is when the logarithmic regret 



is guaranteed in theory. However, again, we see that the 
performance seems to improve in practice with smaller L 
values, even if it is not be theoretically guaranteed. 
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VII. Conclusion 

We have presented CLRMR, a provably efficient online 
learning policy for stochastic combinatorial network optimiza- 
tion with restless Markovian rewards. This algorithm is widely 
applicable to many networking problems of interest, as illus- 
trated by our simulation based evaluation of the policy over 
two different problems: stochastic shortest path and stochastic 
maximum weight bipartite matching. 

One shortcoming of this work is that our focus has been on 
designing and evaluating the policy with respect to the best 
single-action policy. However, in general, with restless Marko- 
vian rewards, it is possible to further improve performance by 
developing an algorithm that dynamically switches between 
different actions over time as the underlying Markov chains 
evolve. Although this problem is much harder and remains 
unsolved except in a special case |16|, we hope to investigate 
it further in our future work. 
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